fix: handle DrainIngress in fake_data_generator to unblock graceful shutdown by sjmsft · Pull Request #2515 · open-telemetry/otel-arrow

sjmsft · 2026-04-02T16:46:34Z

Change Summary

The "Ack nack redesign" PR (3dca283) introduced a two-phase DrainIngress/ReceiverDrained shutdown protocol but missed updating the fake_data_generator receiver. Without the DrainIngress handler, the message falls into the _ => {} catch-all, notify_receiver_drained() is never called, the pipeline controller never removes the receiver from its pending set, and after the deadline expires it emits DrainDeadlineReached. This was causing pipeline-perf-test-basic to fail consistently.

What issue does this PR close?

pipeline-perf-test-basic unit test is failing.

Closes pipeline-perf-test-basic unit test is failing #2511

How are these changes tested?

fake_data_generator and runtime_control_metrics tests were executed.

Are there any user-facing changes?

No, fake_data_generator is an internal test/load-generation receiver, not a user-facing component.

…hutdown

codecov · 2026-04-02T16:51:31Z

Codecov Report

❌ Patch coverage is 99.10714% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.39%. Comparing base (e1742e0) to head (108bc13).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2515      +/-   ##
==========================================
+ Coverage   88.37%   88.39%   +0.02%     
==========================================
  Files         620      622       +2     
  Lines      228395   230170    +1775     
==========================================
+ Hits       201836   203457    +1621     
- Misses      26035    26189     +154     
  Partials      524      524

Components	Coverage Δ
otap-dataflow	`90.22% <99.10%> (+0.01%)`	⬆️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.74% <ø> (ø)`
syslog_cef_receivers	`∅ <ø> (∅)`
otel-arrow-go	`52.45% <ø> (ø)`
quiver	`92.27% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

lalitb · 2026-04-02T17:11:44Z

The sequencing here looks off. In graceful shutdown the runtime does:

DrainIngress -> ReceiverDrained -> downstream Shutdown.

This change makes fake_data_generator do

DrainIngress -> ReceiverDrained -> wait for Shutdown,

but that Shutdown is not part of the normal post-drain receiver path. For this receiver, once ingress is stopped there is no receiver-local work left to preserve, so it should exit directly on DrainIngress rather than report drained and then block waiting for another shutdown message.

lalitb

Please go through the comment here.

lalitb · 2026-04-02T17:14:34Z

The correct fix should be:

DrainIngress -> notify_receiver_drained() -> return TerminalState immediately

Something like (not tested):

  Ok(NodeControlMsg::DrainIngress { deadline, .. }) => {
      otel_info!("fake_data_generator.drain_ingress");                                                                                                                                                                               
      effect_handler.notify_receiver_drained().await?;
      return Ok(TerminalState::new(deadline, [self.metrics.snapshot()]));                                                                                                                                                            
  }

lalitb · 2026-04-02T18:44:00Z

The fix now looks correct. However from CI failures, there looks like a shutdown race in fake_data_generator that is easy to hit on slower runners The test config uses signals_per_second = 1, so the receiver can sleep for close to 1 second between sends, while the shutdown deadline in test_telemetry_registries_cleanup is only 200ms. That means DrainIngress can arrive while the receiver is asleep, the runtime can move into forced shutdown before the receiver handles it, and then notify_receiver_drained().await? can fail with Channel is closed.

One option could be to address this in two places:

make the rate-limit sleep interruptible, since that looks like the root cause here.

if signals_per_second.is_some() {
    let remaining_time = wait_till - Instant::now();
    if remaining_time.as_secs_f64() > 0.0 {
        tokio::select! {
            biased;

            ctrl_msg = ctrl_msg_recv.recv() => {
                // handle DrainIngress / Shutdown during the rate-limit wait
                // using the same control-message handling as the main loop
            }

            _ = sleep(remaining_time) => {}
        }
    }
}

make notify_receiver_drained() best-effort on the terminal DrainIngress path, so a late control-plane teardown does not turn shutdown into an error.

Ok(NodeControlMsg::DrainIngress { deadline, .. }) => {
    otel_info!("fake_data_generator.drain_ingress");
    let _ = effect_handler.notify_receiver_drained().await;
    return Ok(TerminalState::new(deadline, [self.metrics.snapshot()]));
}

…bug_2511

lquerel · 2026-04-04T02:03:32Z

@sjmsft

[in addition to the sequence described by @lalitb ]

The exception is deadline-forced shutdown. If the drain deadline expires before the receiver reports drained, the runtime sends NodeControlMsg::Shutdown { deadline, reason } to any still-pending receivers.

main.rs but the Dockerfile was not updated to copy the file from the otel-arrow build context, breaking the Docker build. Co-authored-by: Copilot <[email protected]>

Verifies the receiver handles DrainIngress promptly even while sleeping in a rate-limit interval. Without the DrainIngress handler the receiver would stall until the drain deadline expired, causing DrainDeadlineReached. Co-authored-by: Copilot <[email protected]>

lalitb · 2026-04-08T21:42:34Z

rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs

+        }
+        Ok(NodeControlMsg::DrainIngress { deadline, .. }) => {
+            otel_info!("fake_data_generator.drain_ingress");
+            let _ = effect_handler.notify_receiver_drained().await;


Suggested change

let _ = effect_handler.notify_receiver_drained().await;

effect_handler.notify_receiver_drained().await?;

lalitb · 2026-04-08T21:57:21Z

rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs

+                                            }
+                                        }
+                                        _ = sleep(remaining_time) => {}
+                                    }


The current sleep is making the DrainIngress/Shutdown responsible , but it is also changing the rate-limiting behavior - so any non-terminal control message handled as Ok(None) exist the sleep immediately and next batch can be sent before the original wait_till. We should replace the line 445-456 above with:

// Keep the original sleep deadline even if non-terminal control // messages arrive. Only DrainIngress/Shutdown should interrupt // the rate-limit wait early. let sleep_until = sleep(remaining_time); tokio::pin!(sleep_until); loop { tokio::select! { biased; ctrl_msg = ctrl_msg_recv.recv() => { if let Some(terminal) = handle_control_msg(ctrl_msg, &effect_handler, &mut self.metrics).await? { return Ok(terminal); } } _ = &mut sleep_until => break, } }

Please check the new changes + new test.

# Change Summary Adds a necessary Dockerfile line to fix the build. Adds a test to our CI/CD workflow, which would have caught this in #2597. This was going to block #2515 --------- Co-authored-by: Copilot <[email protected]>

…bug_2511

lalitb

Thanks.

fix: handle DrainIngress in fake_data_generator to unblock graceful s…

88db336

…hutdown

sjmsft requested a review from a team as a code owner April 2, 2026 16:46

github-project-automation bot added this to OTel-Arrow Apr 2, 2026

github-actions bot added the rust Pull requests that update Rust code label Apr 2, 2026

jmacd approved these changes Apr 2, 2026

View reviewed changes

lalitb requested changes Apr 2, 2026

View reviewed changes

Process PR feedback

7d78b40

Process PR feedback

f26fd21

lalitb mentioned this pull request Apr 2, 2026

flaky test - for copilot #2526

Open

Copilot AI mentioned this pull request Apr 2, 2026

Fix flaky shutdown race in fake_data_generator #2528

Closed

sjmsft and others added 3 commits April 2, 2026 15:05

Merge branch 'main' into bug_2511

54f8e53

Fix cargo fmt issue

40745b4

Merge branch 'bug_2511' of https://github.com/sjmsft/otel-arrow into …

b26144d

…bug_2511

sjmsft requested a review from lalitb April 2, 2026 23:39

jmacd approved these changes Apr 3, 2026

View reviewed changes

sjmsft and others added 2 commits April 3, 2026 14:07

Merge branch 'main' into bug_2511

d25286e

Merge branch 'main' into bug_2511

4862d79

cijothomas mentioned this pull request Apr 8, 2026

fix: resolve pipeline-perf-test-basic CI failures (#2511) #2605

Closed

jmacd force-pushed the bug_2511 branch 2 times, most recently from 91b67d4 to 4862d79 Compare April 8, 2026 19:40

jmacd and others added 3 commits April 8, 2026 12:41

Merge branch 'main' into bug_2511

1dad903

fix: copy THIRD_PARTY_NOTICES.txt into Docker build context

238450e

main.rs but the Dockerfile was not updated to copy the file from the otel-arrow build context, breaking the Docker build. Co-authored-by: Copilot <[email protected]>

jmacd mentioned this pull request Apr 8, 2026

Fix/dockerfile third party notices #2613

Merged

Merge branch 'main' into bug_2511

438e084

jmacd approved these changes Apr 8, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

lalitb reviewed Apr 8, 2026

View reviewed changes

lalitb self-requested a review April 8, 2026 22:00

sjmsft added 2 commits April 8, 2026 16:26

Make non-terminal control messages not break the sleep early

793c3f3

Merge branch 'bug_2511' of https://github.com/sjmsft/otel-arrow into …

108bc13

…bug_2511

lalitb approved these changes Apr 8, 2026

View reviewed changes

jmacd enabled auto-merge April 8, 2026 23:49

jmacd added this pull request to the merge queue Apr 9, 2026

Merged via the queue into open-telemetry:main with commit 9b4b8dc Apr 9, 2026
69 of 70 checks passed

github-project-automation bot moved this to Done in OTel-Arrow Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle DrainIngress in fake_data_generator to unblock graceful shutdown#2515

fix: handle DrainIngress in fake_data_generator to unblock graceful shutdown#2515
jmacd merged 14 commits intoopen-telemetry:mainfrom
sjmsft:bug_2511

sjmsft commented Apr 2, 2026

Uh oh!

codecov bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

lalitb commented Apr 2, 2026

Uh oh!

lalitb left a comment

Uh oh!

lalitb commented Apr 2, 2026 •

edited

Loading

Uh oh!

lalitb commented Apr 2, 2026

Uh oh!

lquerel commented Apr 4, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

lalitb Apr 8, 2026 •

edited

Loading

Uh oh!

lalitb Apr 8, 2026

Uh oh!

sjmsft Apr 8, 2026 •

edited

Loading

Uh oh!

lalitb left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	let _ = effect_handler.notify_receiver_drained().await;
	effect_handler.notify_receiver_drained().await?;

Conversation

sjmsft commented Apr 2, 2026

Change Summary

What issue does this PR close?

How are these changes tested?

Are there any user-facing changes?

Uh oh!

codecov bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lalitb commented Apr 2, 2026

Uh oh!

lalitb left a comment

Choose a reason for hiding this comment

Uh oh!

lalitb commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lalitb commented Apr 2, 2026

Uh oh!

lquerel commented Apr 4, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

lalitb Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lalitb Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

sjmsft Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lalitb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Apr 2, 2026 •

edited

Loading

lalitb commented Apr 2, 2026 •

edited

Loading

lalitb Apr 8, 2026 •

edited

Loading

sjmsft Apr 8, 2026 •

edited

Loading